LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

作者信息

Zhipu.AI, Tsinghua

链接:

LongBench v2

[2412.15204] LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

摘要:

This paper introduces LongBench v2, a benchmark designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. LongBench v2 consists of 503 challenging multiple-choice questions, with contexts ranging from 8k to 2M words, across six major task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repository understanding, and long structured data understanding. To ensure the breadth and the practicality, we collect data from nearly 100 highly educated individuals with diverse professional backgrounds. We employ both automated and manual review processes to maintain high quality and difficulty, resulting in human experts achieving only 53.7% accuracy under a 15-minute time constraint. Our evaluation reveals that the best-performing model, when directly answers the questions, achieves only 50.1% accuracy. In contrast, the o1-preview model, which includes longer reasoning, achieves 57.7%, surpassing the human baseline by 4%. These results highlight the importance of enhanced reasoning ability and scaling inference-time compute to tackle the long-context challenges in LongBench v2. The project is available at this https URL.

目前的benchmark存在的问题

  1. 缺乏深度推理答案可以直接从材料中找到,没有反映LLM在不同任务中的深度理解能力。
  2. 评估指标不可靠。

设计要点:

  1. 上下文需要足够长
  2. 问题需要模型理解上下文并回答,不直接给出上下文答案
  3. 数据应涵盖广泛的长文本场景,体现模型分析的整体能力
  4. 包含长文本,一个问题和四个选项,答案和解释

包含以下任务

image-20250117135403145

  1. 单文本的QA,针对背景进行深入的推理
  2. 多文本的QA
  3. Long In-context learning,比如用户指南QA、新语言翻译……
  4. Code Repository Understanding,比如跨文件的推理
  5. Long Structured Data Understanding,比如表的QA,qk graph的复杂查询……

实验结果:

image-20250117143000221

缺陷:

  1. Benchmark size不够大
  2. 语言为英语
  3. 在不同长度下任务分布不均,可能模型B在短文本比A好,但在长文本A比B更好

results matching ""

    No results matching ""